Project 3#
How Economic Prosperity Shapes Global Life Expectancy đ#
Name: Zenia Clarissa Bhaswata Putri
UNI: zc2709
Introduction#
Objective: Investigate whether higher GDP per capita associated with better health outcomes by analyzing the relationship between GDP per capita and life expectancy for various countries over time, focusing on 2021, as the latest/updated datasets obtained.
Hypothesis: In 2021, countries with higher GDP per capita are expected to have higher life expectancy due to better healthcare access, nutrition, and standards of living.
Analysis Question: Is there a positive correlation between GDP per capita and life expectancy across countries in 2021, suggesting that economic prosperity leads to improved healthcare access, nutrition, and standards of living?
Datasets to be used: I am using the GDP Per Capita dataset from World Bank, and Life Expectancy Rates from WHO. The data is across 179 countries within year 2012 - 2021. However, for the final visualization and analysis I chose the year 2021 because it is the latest year for which both the GDP per capita and life expectancy datasets are available. It is like working with the freshest ingredientsâyou get the most accurate and up-to-date snapshot of global trends. Plus, using the same year for both datasets ensures consistency, making the analysis more reliable and meaningful.
Datasets links:
đ¸ GDP Per capita data from World Bank.
đĽ Life expectancy rates data from WHO
Step 1: Load and Inspect Datasets#
Import necessary Python packages (pandas for data handling)
Load both datasets into separate dataframes and inspect the structure to understand columns and missing data
I began by importing the necessary packages â pandas and plotly. Pandas is used for data manipulation, while plotly is for creating visualizations.
import plotly.io as pio
pio.renderers.default = "vscode+jupyterlab+notebook_connected"
import pandas as pd
import plotly.express as px
Next, I read my main dataframes for economic prosperity (GDP_percapita_data.csv) into the notebook to start exploring the data. I displayed the first few rows of both datasets to verify that they had been read correctly and to understand their structure. After that, I applied some basic functions to check the contents of key columns and clean missing values. I also look for unique country code in the datasets to prepare the merging action with the WHO data for life expectancy rates data. This initial exploration and data cleaning step is crucial to ensure that the data is ready for analysis and visualization.
gdp_percapita = pd.read_csv("GDP_percapita_data.csv")
gdp_percapita.head()
| Series Name | Series Code | Country Name | Country Code | 2008 [YR2008] | 2012 [YR2012] | 2013 [YR2013] | 2014 [YR2014] | 2015 [YR2015] | 2016 [YR2016] | 2017 [YR2017] | 2018 [YR2018] | 2019 [YR2019] | 2020 [YR2020] | 2021 [YR2021] | 2022 [YR2022] | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | GDP per capita (current US$) | NY.GDP.PCAP.CD | Afghanistan | AFG | 382.5338072 | 653.4174749 | 638.733181 | 626.5129291 | 566.8811297 | 523.053012 | 526.140801 | 492.090631 | 497.7414313 | 512.055098 | 355.7778264 | 352.6037331 |
| 1 | GDP per capita (current US$) | NY.GDP.PCAP.CD | Albania | ALB | 4370.539716 | 4247.631343 | 4413.063383 | 4578.633208 | 3952.803574 | 4124.05539 | 4531.032207 | 5287.660801 | 5396.214243 | 5343.037704 | 6377.203096 | 6810.114041 |
| 2 | GDP per capita (current US$) | NY.GDP.PCAP.CD | Algeria | DZA | 5217.991822 | 6096.090015 | 6044.674903 | 6164.644699 | 4741.49977 | 4481.081962 | 4615.868744 | 4640.314145 | 4530.101745 | 3794.409524 | 4216.251285 | 5023.252932 |
| 3 | GDP per capita (current US$) | NY.GDP.PCAP.CD | American Samoa | ASM | 10019.50225 | 11920.06109 | 12038.87159 | 12313.99736 | 13101.54182 | 13300.82461 | 12372.88478 | 13195.9359 | 13672.57666 | 15609.77722 | 16653.71378 | 19673.3901 |
| 4 | GDP per capita (current US$) | NY.GDP.PCAP.CD | Andorra | AND | 53938.85213 | 44902.38077 | 44747.75386 | 45680.53499 | 38885.53032 | 39931.21698 | 40632.23155 | 42904.82846 | 41328.6005 | 37207.222 | 42066.49052 | 42350.69707 |
gdp_percapita.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 217 entries, 0 to 216
Data columns (total 16 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Series Name 217 non-null object
1 Series Code 217 non-null object
2 Country Name 217 non-null object
3 Country Code 217 non-null object
4 2008 [YR2008] 217 non-null object
5 2012 [YR2012] 217 non-null object
6 2013 [YR2013] 217 non-null object
7 2014 [YR2014] 217 non-null object
8 2015 [YR2015] 217 non-null object
9 2016 [YR2016] 217 non-null object
10 2017 [YR2017] 217 non-null object
11 2018 [YR2018] 217 non-null object
12 2019 [YR2019] 217 non-null object
13 2020 [YR2020] 217 non-null object
14 2021 [YR2021] 217 non-null object
15 2022 [YR2022] 217 non-null object
dtypes: object(16)
memory usage: 27.3+ KB
To make sure that the GDP per Capita data using the ISO 3-digit code for country code, I examine by seeing the country code uniqueness:
gdp_percapita["Country Code"].unique()
array(['AFG', 'ALB', 'DZA', 'ASM', 'AND', 'AGO', 'ATG', 'ARG', 'ARM',
'ABW', 'AUS', 'AUT', 'AZE', 'BHS', 'BHR', 'BGD', 'BRB', 'BLR',
'BEL', 'BLZ', 'BEN', 'BMU', 'BTN', 'BOL', 'BIH', 'BWA', 'BRA',
'VGB', 'BRN', 'BGR', 'BFA', 'BDI', 'CPV', 'KHM', 'CMR', 'CAN',
'CYM', 'CAF', 'TCD', 'CHI', 'CHL', 'CHN', 'COL', 'COM', 'COD',
'COG', 'CRI', 'CIV', 'HRV', 'CUB', 'CUW', 'CYP', 'CZE', 'DNK',
'DJI', 'DMA', 'DOM', 'ECU', 'EGY', 'SLV', 'GNQ', 'ERI', 'EST',
'ETH', 'FRO', 'FJI', 'FIN', 'FRA', 'PYF', 'GAB', 'GMB', 'GEO',
'DEU', 'GHA', 'GIB', 'GRC', 'GRL', 'GRD', 'GUM', 'GTM', 'GIN',
'GNB', 'GUY', 'HTI', 'HND', 'HKG', 'HUN', 'ISL', 'IND', 'IDN',
'IRN', 'IRQ', 'IRL', 'IMN', 'ISR', 'ITA', 'JAM', 'JPN', 'JOR',
'KAZ', 'KEN', 'KIR', 'PRK', 'KOR', 'XKX', 'KWT', 'KGZ', 'LAO',
'LVA', 'LBN', 'LSO', 'LBR', 'LBY', 'LIE', 'LTU', 'LUX', 'MAC',
'MKD', 'MDG', 'MWI', 'MYS', 'MDV', 'MLI', 'MLT', 'MHL', 'MRT',
'MUS', 'MEX', 'FSM', 'MDA', 'MCO', 'MNG', 'MNE', 'MAR', 'MOZ',
'MMR', 'NAM', 'NRU', 'NPL', 'NLD', 'NCL', 'NZL', 'NIC', 'NER',
'NGA', 'MNP', 'NOR', 'OMN', 'PAK', 'PLW', 'PAN', 'PNG', 'PRY',
'PER', 'PHL', 'POL', 'PRT', 'PRI', 'QAT', 'ROU', 'RUS', 'RWA',
'WSM', 'SMR', 'STP', 'SAU', 'SEN', 'SRB', 'SYC', 'SLE', 'SGP',
'SXM', 'SVK', 'SVN', 'SLB', 'SOM', 'ZAF', 'SSD', 'ESP', 'LKA',
'KNA', 'LCA', 'MAF', 'VCT', 'SDN', 'SUR', 'SWZ', 'SWE', 'CHE',
'SYR', 'TJK', 'TZA', 'THA', 'TLS', 'TGO', 'TON', 'TTO', 'TUN',
'TUR', 'TKM', 'TCA', 'TUV', 'UGA', 'UKR', 'ARE', 'GBR', 'USA',
'URY', 'UZB', 'VUT', 'VEN', 'VNM', 'VIR', 'PSE', 'YEM', 'ZMB',
'ZWE'], dtype=object)
Since the GDP per capita data already has all the columns I need, thereâs no point in removing anything. No need to overthink itâevery column included seems to be relevant for answering the questions Iâm looking into.
Next, I do the same things with the WHO data:
life_exp = pd.read_csv("WHO_data3.csv")
life_exp.head()
| IndicatorCode | Indicator | ValueType | ParentLocationCode | ParentLocation | Location type | SpatialDimValueCode | Location | Period type | Period | ... | FactValueUoM | FactValueNumericLowPrefix | FactValueNumericLow | FactValueNumericHighPrefix | FactValueNumericHigh | Value | FactValueTranslationID | FactComments | Language | DateModified | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | WHOSIS_000001 | Life expectancy at birth (years) | text | AFR | Africa | Country | LSO | Lesotho | Year | 2021 | ... | NaN | NaN | 50.49 | NaN | 52.57 | 51.5 [50.5-52.6] | NaN | NaN | EN | 2024-08-02T04:00:00.000Z |
| 1 | WHOSIS_000001 | Life expectancy at birth (years) | text | AFR | Africa | Country | CAF | Central African Republic | Year | 2021 | ... | NaN | NaN | 51.06 | NaN | 53.36 | 52.3 [51.1-53.4] | NaN | NaN | EN | 2024-08-02T04:00:00.000Z |
| 2 | WHOSIS_000001 | Life expectancy at birth (years) | text | EMR | Eastern Mediterranean | Country | SOM | Somalia | Year | 2021 | ... | NaN | NaN | 52.92 | NaN | 55.11 | 54.0 [52.9-55.1] | NaN | NaN | EN | 2024-08-02T04:00:00.000Z |
| 3 | WHOSIS_000001 | Life expectancy at birth (years) | text | AFR | Africa | Country | SWZ | Eswatini | Year | 2021 | ... | NaN | NaN | 53.49 | NaN | 55.87 | 54.6 [53.5-55.9] | NaN | NaN | EN | 2024-08-02T04:00:00.000Z |
| 4 | WHOSIS_000001 | Life expectancy at birth (years) | text | AFR | Africa | Country | MOZ | Mozambique | Year | 2021 | ... | NaN | NaN | 56.64 | NaN | 58.77 | 57.7 [56.6-58.8] | NaN | NaN | EN | 2024-08-02T04:00:00.000Z |
5 rows Ă 34 columns
life_exp.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1850 entries, 0 to 1849
Data columns (total 34 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 IndicatorCode 1850 non-null object
1 Indicator 1850 non-null object
2 ValueType 1850 non-null object
3 ParentLocationCode 1850 non-null object
4 ParentLocation 1850 non-null object
5 Location type 1850 non-null object
6 SpatialDimValueCode 1850 non-null object
7 Location 1850 non-null object
8 Period type 1850 non-null object
9 Period 1850 non-null int64
10 IsLatestYear 1850 non-null bool
11 Dim1 type 1850 non-null object
12 Dim1 1850 non-null object
13 Dim1ValueCode 1850 non-null object
14 Dim2 type 0 non-null float64
15 Dim2 0 non-null float64
16 Dim2ValueCode 0 non-null float64
17 Dim3 type 0 non-null float64
18 Dim3 0 non-null float64
19 Dim3ValueCode 0 non-null float64
20 DataSourceDimValueCode 0 non-null float64
21 DataSource 0 non-null float64
22 FactValueNumericPrefix 0 non-null float64
23 FactValueNumeric 1850 non-null float64
24 FactValueUoM 0 non-null float64
25 FactValueNumericLowPrefix 0 non-null float64
26 FactValueNumericLow 1847 non-null float64
27 FactValueNumericHighPrefix 0 non-null float64
28 FactValueNumericHigh 1847 non-null float64
29 Value 1850 non-null object
30 FactValueTranslationID 0 non-null float64
31 FactComments 0 non-null float64
32 Language 1850 non-null object
33 DateModified 1850 non-null object
dtypes: bool(1), float64(17), int64(1), object(15)
memory usage: 478.9+ KB
Step 2: Data Cleaning & Pre-processing#
Handle missing or invalid data by filtering or imputing values
Reshaping the data format (columns/rows)
Standardize country names across datasets (using ISO 3-digit country code)
Ensure time periods align (e.g. in this case we want to look at data in 2021)
Cleaning and pre-processing GDP percapita data
Columns that will likely be used:
To prepare the GDP per capita dataset for analysis, I began by selecting the necessary columns and reshaping the data. The original format had years as separate columns for each country, so I transformed/reshaped it into a tidy format with each year as a separate row (following the Life Expectancy data format). Additionally, I cleaned the year columns to remove the [YRxxxx] format, leaving only the numeric year. These steps ensured the dataset was organized, consistent, and ready for analysis alongside the life expectancy data.
# Cleaning and preprocessing GDP data
# Selecting necessary columns and reshaping the data
gdp_cleaned = gdp_percapita.melt(
id_vars=["Country Name", "Country Code"],
value_vars=[col for col in gdp_percapita.columns if "[YR" in col],
var_name="Year",
value_name="GDP per Capita"
)
# Extract the year from column names
gdp_cleaned["Year"] = gdp_cleaned["Year"].str.extract(r"(\d{4})").astype(int)
gdp_cleaned["GDP per Capita"] = pd.to_numeric(gdp_cleaned["GDP per Capita"], errors="coerce")
# Displaying the cleaned World Bank dataframe
gdp_cleaned.head()
| Country Name | Country Code | Year | GDP per Capita | |
|---|---|---|---|---|
| 0 | Afghanistan | AFG | 2008 | 382.533807 |
| 1 | Albania | ALB | 2008 | 4370.539716 |
| 2 | Algeria | DZA | 2008 | 5217.991822 |
| 3 | American Samoa | ASM | 2008 | 10019.502250 |
| 4 | Andorra | AND | 2008 | 53938.852130 |
I cleaned the year columns to remove the [YRxxxx] format, leaving only the numeric year.
# Cleaning the GDP per capita data to remove '[YRxxxx]' from year columns
gdp_cleaned.columns = gdp_cleaned.columns.str.replace(r'\[YR\d{4}\]', '', regex=True).str.strip()
gdp_cleaned.head()
| Country Name | Country Code | Year | GDP per Capita | |
|---|---|---|---|---|
| 0 | Afghanistan | AFG | 2008 | 382.533807 |
| 1 | Albania | ALB | 2008 | 4370.539716 |
| 2 | Algeria | DZA | 2008 | 5217.991822 |
| 3 | American Samoa | ASM | 2008 | 10019.502250 |
| 4 | Andorra | AND | 2008 | 53938.852130 |
Cleaning and pre-processing Life Expectancy data
Columns that will likely be used:
After inspecting the contents of the WHO data, I identified several columns that were not relevant to the questions I am investigating. Columns such as âFactCommentsâ, âFactValueTranslationIDâ, and others contained additional metadata that did not contribute to my analysis of GDP per capita and life expectancy. Since I only needed âSpatialDimValueCodeâ, âPeriodâ, and âFactValueNumericâ columns to link location, year, and life expectancy data, I decided to drop all other columns.
Additionally, I renamed the remaining columns to make the dataset more intuitive and visually appealing. For instance, I simplified âSpatialDimValueCodeâ to âCountry Codeâ, âPeriodâ to âYearâ and âFactValueNumericâ to âLife Expectancyâ. This step helped streamline the dataframe for analysis and visualization, ensuring it contained only the necessary and meaningful data.
# Re-assigning for this isolated environment
life_exp = pd.read_csv('WHO_data3.csv')
# Dropping unnecessary columns
columns_to_drop = [
"IndicatorCode", "Indicator", "ValueType", "ParentLocation",
"Location type", "ParentLocationCode", "Dim1 type", "Dim1",
"Dim1ValueCode", "Dim2 type", "Dim2", "Dim2ValueCode",
"Dim3 type", "Dim3", "Dim3ValueCode", "DataSourceDimValueCode",
"DataSource", "FactValueNumericPrefix", "FactValueUoM",
"FactValueNumericLowPrefix", "FactValueNumericLow",
"FactValueNumericHighPrefix", "FactValueNumericHigh",
"Value", "FactValueTranslationID", "FactComments",
"Language", "DateModified"
]
# Dropping the columns
life_exp.drop(columns=columns_to_drop, axis=1, inplace=True)
# Renaming columns for clarity
life_exp.rename(columns={
"SpatialDimValueCode": "Country Code",
"Period": "Year",
"FactValueNumeric": "Life Expectancy"
}, inplace=True)
# Storing the WHO cleaned dataframe
life_exp_cleaned = life_exp
# Displaying the WHO cleaned dataframe
life_exp_cleaned.head()
| Country Code | Location | Period type | Year | IsLatestYear | Life Expectancy | |
|---|---|---|---|---|---|---|
| 0 | LSO | Lesotho | Year | 2021 | True | 51.48 |
| 1 | CAF | Central African Republic | Year | 2021 | True | 52.31 |
| 2 | SOM | Somalia | Year | 2021 | True | 53.95 |
| 3 | SWZ | Eswatini | Year | 2021 | True | 54.59 |
| 4 | MOZ | Mozambique | Year | 2021 | True | 57.66 |
Filtering for the year 2021#
Life Expectancy Data
As mentioned previously, I chose the year 2021 for my analysis so I then filtered the Life Expectancy data for the year 2021
# Filtering the data for the year 2021
life_exp_2021 = life_exp_cleaned[life_exp_cleaned["Year"] == 2021]
And also sort the data in descending order based on the Life Expectancy. This will reorder the rows so that the countries with the highest life expectancy are at the top
# Sorting the data by Life Expectancy in descending order
life_exp_2021_sorted = life_exp_2021.sort_values(by="Life Expectancy", ascending=False)
life_exp_2021_sorted
| Country Code | Location | Period type | Year | IsLatestYear | Life Expectancy | |
|---|---|---|---|---|---|---|
| 184 | JPN | Japan | Year | 2021 | True | 84.46 |
| 183 | SGP | Singapore | Year | 2021 | True | 83.86 |
| 182 | KOR | Republic of Korea | Year | 2021 | True | 83.80 |
| 181 | CHE | Switzerland | Year | 2021 | True | 83.33 |
| 180 | AUS | Australia | Year | 2021 | True | 83.10 |
| ... | ... | ... | ... | ... | ... | ... |
| 4 | MOZ | Mozambique | Year | 2021 | True | 57.66 |
| 3 | SWZ | Eswatini | Year | 2021 | True | 54.59 |
| 2 | SOM | Somalia | Year | 2021 | True | 53.95 |
| 1 | CAF | Central African Republic | Year | 2021 | True | 52.31 |
| 0 | LSO | Lesotho | Year | 2021 | True | 51.48 |
185 rows Ă 6 columns
I am now looking at the top rows that gives me information of list of countries with the highest life expectancy in 2021. It is easier to see which countries are leading the way in terms of health, and overall longevity, before I merge it with the GDP per Capita data to test the hypothesis.
GDP Per Capita Data
I then do the same with the cleaned GDP datasets: filter the year 2021, and sort the data in descending order based on the GDP per Capita. This will reorder the rows so that the countries with the highest GDP per Capita are at the top
# Filtering the data for the year 2021
gdp_cleaned_2021 = gdp_cleaned[gdp_cleaned["Year"] == 2021]
# Sorting the data by GDP per capita in descending order
gdp_cleaned_2021_sorted = gdp_cleaned_2021.sort_values(by="GDP per Capita", ascending=False)
gdp_cleaned_2021_sorted
| Country Name | Country Code | Year | GDP per Capita | |
|---|---|---|---|---|
| 2300 | Monaco | MCO | 2021 | 235132.7842 |
| 2283 | Liechtenstein | LIE | 2021 | 197504.5489 |
| 2285 | Luxembourg | LUX | 2021 | 133711.7944 |
| 2191 | Bermuda | BMU | 2021 | 114274.6220 |
| 2262 | Ireland | IRL | 2021 | 102001.7982 |
| ... | ... | ... | ... | ... |
| 2272 | Korea, Dem. People's Rep. | PRK | 2021 | NaN |
| 2315 | Northern Mariana Islands | MNP | 2021 | NaN |
| 2347 | South Sudan | SSD | 2021 | NaN |
| 2380 | Venezuela, RB | VEN | 2021 | NaN |
| 2384 | Yemen, Rep. | YEM | 2021 | NaN |
217 rows Ă 4 columns
I am now looking at the top rows that gives me information of list of countries with the highest GDP per capita in 2021. It is easier to see which countries are leading the way in terms of economic strength before I merge it with the life expectancy data to test the hypothesis.
Step 3: Data Analysis#
Individual Data Analysis & Visualization#
GDP per Capita#
Next, I decided to create individual visualizations for the 2021 GDP per capita data and life expectancy data to analyze each dataset on its own before merging them. For the GDP data, I went with a choropleth map because I wanted to capture the spatial distribution of wealth across the globe. Seeing the data on a worldwide map helps highlight regional patterns and pinpoint clusters of high and low GDP, which is super useful for understanding economic disparities đđ¸.
# Creating a choropleth map for GDP per Capita in 2021
fig = px.choropleth(
gdp_cleaned_2021,
locations="Country Code", # ISO country codes
color="GDP per Capita", # Color scale based on GDP per Capita
hover_name="Country Name", # Hover shows country name
title="Worldwide GDP per Capita in 2021",
color_continuous_scale="Blues", # Use a blue color scale for visual effect
labels={"GDP per Capita": "GDP per Capita (USD)"}
)
# Enforcing the range for the color axis
fig.update_layout(
coloraxis_colorbar=dict(title="GDP per Capita"),
coloraxis=dict(cmin=5000, cmax=90000), # Force color scale range
geo=dict(
showcoastlines=True,
coastlinecolor="Black",
showland=True,
landcolor="LightGray",
showcountries=True,
countrycolor="Black",
projection_type="natural earth",
),
height=600,
margin={"r": 0, "t": 50, "l": 0, "b": 0},
)
# Displaying the map
fig.show()
đThe choropleth map of GDP per capita in 2021 showcases a stark global disparity in economic wealth. Darker shades, representing higher GDP per capita, are concentrated in North America, Western Europe, and select regions in Asia and Oceania. Countries such as the United States, Canada, Luxembourg, Norway, and Singapore stand out prominently, reflecting their strong economies and high living standards. It would likely reflect that these countries are a highly developed countries with access to world-class healthcare, education, and infrastructure. Additionally, high productive key sectors like mining, finance, and tourism play an important role. However, there are possible disparities in income distribution despite the high average figure.
In contrast, much of Sub-Saharan Africa and parts of South Asia are depicted in lighter shades, indicating significantly lower GDP per capita. This highlights persistent economic challenges in these regions, which often face issues like underdeveloped infrastructure, limited access to resources, and political instability.
Additionally, areas such as the Middle East, particularly resource-rich countries like Qatar and the UAE, likely demonstrate high GDP per capita, although their smaller geographical representation on the map makes it less visually striking.
Life Expectancy#
For life expectancy, I created a bar chart to get a clear view of the distribution by country. This approach makes it easier to compare countries directly and spot outliers or trends, like which nations are leading the charge in longevity and which are falling behind. Breaking it down this way allows for a deeper understanding of each dataset before combining them for the bigger picture! đâ¨
# Create a sorted bar chart for life expectancy by country within each region
fig_countries = px.bar(
life_exp_2021,
x="Location", # Country names
y="Life Expectancy", # Life expectancy values
color="Country Code", # Group countries by region
title="Life Expectancy Distribution by Country (2021)",
labels={"Life Expectancy": "Life Expectancy (Years)", "Country Code": "Country"},
)
fig_countries.update_layout(
xaxis_tickangle=45, # Rotate x-axis labels for better visibility
height=600
)
# Show the country-level distribution chart
fig_countries.show()
đThe bar chart provides a country-by-country breakdown of life expectancy in 2021, showcasing significant disparities across the globe. At the higher end of the spectrum, countries like Japan, Singapore, and Switzerland exhibit life expectancies above 80 years, reflecting advanced healthcare systems, robust social policies, and healthier lifestyles. These countries lead globally in longevity and are benchmarks for health and well-being.
In contrast, countries like Lesotho, the Central African Republic, and Somalia are positioned at the lower end, with life expectancies below 60 years. This highlights ongoing challenges such as limited healthcare access, malnutrition, and the impact of political instability and economic hardship.
The overall distribution emphasizes the stark divide between high-income nations and low-income nations regarding health outcomes. While many countries cluster around the global average of 70â75 years, the extremes highlight the crucial role of economic, social, and cultural factors in shaping life expectancy. This chart serves as a reminder of the pressing need to address inequalities in global health infrastructure and resources.
Merging the two Datasets#
Next, I went ahead and merged the GDP and life expectancy datasets using the Country Code columns as the keys. These are the most logical connections since they represent the location dimensions of the data, and already formatted in the same ISO 3-digit code from what I have done in the cleaning and pre-processing data step before.
To keep things clean and accurate, I explicitly specified the columns in the merge function. This step ensured that every countryâs GDP per capita was perfectly aligned with its corresponding life expectancy for each year. The result? A unified dataset that brings together economic and health indicatorsâexactly what I need to dive deeper into the analysis. Letâs see what insights this powerhouse combo reveals! đđ
final_merged_2021 = pd.merge(
life_exp_2021,
gdp_cleaned_2021,
on=["Country Code", "Year"],
how="inner"
)
final_merged_2021
| Country Code | Location | Period type | Year | IsLatestYear | Life Expectancy | Country Name | GDP per Capita | |
|---|---|---|---|---|---|---|---|---|
| 0 | LSO | Lesotho | Year | 2021 | True | 51.48 | Lesotho | 1054.932740 |
| 1 | CAF | Central African Republic | Year | 2021 | True | 52.31 | Central African Republic | 461.137511 |
| 2 | SOM | Somalia | Year | 2021 | True | 53.95 | Somalia | 576.523678 |
| 3 | SWZ | Eswatini | Year | 2021 | True | 54.59 | Eswatini | 4068.573790 |
| 4 | MOZ | Mozambique | Year | 2021 | True | 57.66 | Mozambique | 504.037759 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 180 | AUS | Australia | Year | 2021 | True | 83.10 | Australia | 60697.245440 |
| 181 | CHE | Switzerland | Year | 2021 | True | 83.33 | Switzerland | 93446.434450 |
| 182 | KOR | Republic of Korea | Year | 2021 | True | 83.80 | Korea, Rep. | 35125.522500 |
| 183 | SGP | Singapore | Year | 2021 | True | 83.86 | Singapore | 79601.412960 |
| 184 | JPN | Japan | Year | 2021 | True | 84.46 | Japan | 40058.537330 |
185 rows Ă 8 columns
Next I save the final_merged_2021 file into csv to my computer folder in case I need that for future analysis, and for a quick check for all the columns and rows before I make the merged visualization
final_merged_2021.to_csv("final_merged_2021.csv", index=False)
Step 4: Final Visualizations on the merged data#
Final Visualization:
Scatter Plots: With a scatter plot, it is easy to visually compare multiple countries and see where they fall on the spectrum of wealth and life expectancy, adding depth to the interpretation of the data
Choropleth Map: Storytelling with Geography â> The choropleth map helps add a geographical layer to the analysis, showing how wealth and life expectancy is distributed globally, and potentially highlighting areas that are outperforming or lagging behind their neighbors
The Scatter Plots#
# Creating a combined scatter plot to analyze GDP per capita vs Life Expectancy over time
fig = px.scatter(
final_merged_2021, # Using the merged dataset
x="GDP per Capita", # GDP per capita on x-axis
y="Life Expectancy", # Life expectancy on y-axis
color="Country Code", # Different colors for each country
animation_frame="Year", # Animation over time (year)
hover_name="Country Name", # Hover to show country names
title="Relationship Between GDP per Capita and Life Expectancy in 2021",
trendline="ols",
labels={
"GDP per Capita": "GDP per Capita (USD)",
"Life Expectancy": "Life Expectancy (Years)",
"Country Code": "Country",
},
)
# Show the plot
fig.show()
# Correcting and rewriting the code for filtering the dataset by Country Code, Life Expectancy, and GDP per Capita
highest_gdplifeexp = final_merged_2021.filter(
["Country Code", "Life Expectancy", "GDP per Capita"], axis=1
)
# Resetting the index to make it cleaner
highest_gdplifeexp = highest_gdplifeexp.reset_index(drop=True)
# Displaying the first few rows of the filtered dataset
highest_gdplifeexp.head()
| Country Code | Life Expectancy | GDP per Capita | |
|---|---|---|---|
| 0 | LSO | 51.48 | 1054.932740 |
| 1 | CAF | 52.31 | 461.137511 |
| 2 | SOM | 53.95 | 576.523678 |
| 3 | SWZ | 54.59 | 4068.573790 |
| 4 | MOZ | 57.66 | 504.037759 |
The Choropleth Map#
import plotly.express as px
# Creating a choropleth map to analyze the relationship between GDP per Capita and Life Expectancy over time
fig = px.choropleth(
final_merged_2021, # Using the merged dataset
locations="Country Code", # ISO country codes for mapping
color="Life Expectancy", # Color scale based on Life Expectancy
hover_name="Country Name", # Show country names on hover
hover_data={"GDP per Capita": True, "Life Expectancy": True}, # Include GDP per Capita in hover info
animation_frame="Year", # Animation over time (year)
title="Choropleth: Global Relationship Between GDP per Capita and Life Expectancy in 2021 đ",
labels={
"Life Expectancy": "Life Expectancy (Years)",
"GDP per Capita": "GDP per Capita (USD)",
"Country Code": "Country",
},
color_continuous_scale="Viridis", # A diverse color scale for better contrast
)
# Updating the layout for a clean and interactive display
fig.update_layout(
height=600,
margin={"r": 0, "t": 50, "l": 0, "b": 0},
geo=dict(
projection_type="natural earth",
showcoastlines=True,
coastlinecolor="Black",
showland=True,
landcolor="LightGray",
),
)
# Show the interactive map
fig.show()
Summarize key findings between two datasets: In 2021, does GDP per capita strongly correlate with life expectancy?
Step 5: đ Letâs dive into the findings!#
From the Scatter Plot
The scatter plot visualizes the relationship between GDP per capita and life expectancy for countries in 2021, showing a clear positive correlation. As GDP per capita increases, life expectancy generally rises, though the relationship is not perfectly linear. Countries with low GDP per capita cluster in the lower-left, where life expectancy tends to be below 70 years. However, beyond a certain GDP per capita threshold (around USD 40,000), the slope of the line tends to flatten as GDP per capita increases, indicating that the impact of higher GDP per capita on life expectancy diminishes at higher income levels. This is a common pattern, where initial increases in income significantly improve life quality and health outcomes, but the marginal impact lessens at higher income levels.
However, there are a few outliers where countries with a relatively high GDP per capita (above USD 60,000) have a lower life expectancy compared to the general trend; which are Qatar and The U.S. (both have life expectancy at 76 years). These countries could be experiencing factors like inequality, or other social factors that hinder life expectancy despite high income levels.
From the Choropleth map
The Choropleth map highlights a clear pattern in 2021: Countries with higher GDP per capita tend to have higher life expectancy, represented by lighter shades on the map. This trend is most evident in regions like Western Europe, North America, and parts of Asia (e.g., Japan and Singapore), where economic prosperityđ¸ aligns with access to advanced healthcare, better nutrition, and improved living conditions. Conversely, darker shades dominate in lower-income regions, particularly in Sub-Saharan Africa, where limited resources, weaker healthcare systems, and socioeocnomic challenges contribute to shorter life expectancies. While the correlation is trong, it is again does not imply a causal relationship between the variables. Notable exceptions like Japan (with a high life expectancy despite a comparatively moderate GDP per capita) emphasize the influence of cultural and lifestyle factors alongside economic wealth.
Here is the interesting part:
Even countries with sky-high GDP per capita (think over USD 100,000) are not showing drastically higher life expectancy compared to those sitting around USD 60,000 to USD 80,000. This just goes to show that money is not everything-factors like efficient healthcare, strong social systems, and healthy lifestyles clearly play a massive role.
On the flip side, countries like Japan, with a more moderate GDP per capita, manage to achieve incredible life expectancy, proving that cultural habits, diet, and healthcare quality can outweigh pure economic wealth. So, while GDP per capita is definitely a solid predictor of life expectancy, itâs far from the whole story.
Step 6: Conclusion â¨#
Based on the analysis of 179 countries in 2021, people tend to live longer in countries with a high GDP per capita. Youâll never see a high-income country with a short life expectancy, nor a low-income country where people live exceptionally long lives. However, the story doesnât end thereâlife expectancy can vary significantly even among countries with similar income levels. It all comes down to how the wealth is distributed and, more importantly, how itâs spent. Investments in healthcare, education, and social support make all the difference, proving that itâs not just about having moneyâitâs about using it wisely.
Proving the Hypothesis
The hypothesis proposed that countries with higher GDP per capita would have higher life expectancy due to better access to healthcare, nutrition, and living standards. The findings mostly support this hypothesis: higher GDP per capita generally correlates with longer life expectancy, confirming that wealth provides the foundation for better quality of life and healthcare infrastructure. High-income countries, on average, have better healthcare systems, nutrition, and living conditions, which all contribute to a longer lifespan.
However, this research also uncovered critical nuances. Wealth alone is insufficient to guarantee the highest life expectancy; the efficiency of healthcare, public policies, social cohesion, and lifestyle choices play major roles as well. Countries like Japan with moderate GDP per capita, but highly effective healthcare and cultural habits, consistently show high life expectancyâsuggesting that the relationship is more complex than just wealth.
Answering the Research Question
âIs there a positive correlation between GDP per capita and life expectancy across countries in 2021, suggesting that economic prosperity leads to improved healthcare access, nutrition, and standards of living?â
The analysis indicates a positive correlation between GDP per capita and life expectancy, affirming that economic prosperity indeed plays a role in improving living conditions, access to quality healthcare, and nutrition. However, the correlation is not linear or absolute. The data show that, beyond a certain threshold of wealth, life expectancy improvements are influenced more by how the resources are utilizedâinvestments in healthcare systems, societal factors, and lifestyle choices. Therefore, while higher GDP per capita generally indicates better life expectancy, true longevity also depends on a holistic approach that involves effective healthcare systems, quality of public services, and social well-being.
This deeper understanding helps to not only prove the hypothesis but also highlight that effective policy-making and health-focused investments are crucial for achieving the best outcomes in life expectancy, regardless of GDP levels.
Site URL#
Click here to see the publish section đ